{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Natural Language Processing (NLP) Introduction\n",
    "-------------------\n",
    "\n",
    "In this chapter we cover the following topics:\n",
    " - Working with Bag of Words\n",
    " - Implementing TF-IDF\n",
    " - Working with Skip-gram Embeddings\n",
    " - Working with CBOW Embeddings\n",
    " - Making Predictions with Word2vec\n",
    " - Using Doc2vec for Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Up to this point, we have only considered machine learning algorithms that mostly operate on numerical inputs.  If we want to use text, we must find a way to convert the text into numbers.  There are many ways to do this and we will explore a few common ways this is achieved.\n",
    "\n",
    "If we consider the sentence **“tensorflow makes machine learning easy”**, we could convert the words to numbers in the order that we observe them.  This would make the sentence become “1 2 3 4 5”.  Then when we see a new sentence, **“machine learning is easy”**, we can translate this as “3 4 0 5”. Denoting words we haven’t seen bore with an index of zero.  With these two examples, we have limited our vocabulary to 6 numbers.  With large texts we can choose how many words we want to keep, and usually keep the most frequent words, labeling everything else with the index of zero.\n",
    "\n",
    "If the word “learning” has a numerical value of 4, and the word “makes” has a numerical value of 2, then it would be natural to assume that “learning” is twice “makes”.  Since we do not want this type of numerical relationship between words, we assume these numbers represent categories and not relational numbers.\n",
    "Another problem is that these two sentences are of different size. Each observation we make (sentences in this case) need to have the same size input to a model we wish to create.  To get around this, we create each sentence into a sparse vector that has that value of one in a specific index if that word occurs in that index."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| word --> | tensorflow | makes | machine | learning | easy |\n",
    "|:----:|:-----:|:-----:|:-----:|:-----:|:-----:|\n",
    "| word index --> | 1 | 2 | 3 | 4 | 5 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The occurrence vector would then be:\n",
    "\n",
    "    sentence1 = [0, 1, 1, 1, 1, 1]\n",
    "\n",
    "This is a vector of length 6 because we have 5 words in our vocabulary and we reserve the 0-th index for unknown or rare words\n",
    "\n",
    "Now consider the sentence, **'machine learning is easy'**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "| word --> | machine | learning | is | easy |\n",
    "|:----:|:-----:|:-----:|:-----:|:-----:|\n",
    "| word index --> | 3 | 4 | 0 | 5 |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The occurrence vector for this sentence is now:\n",
    "\n",
    "    sentence2 = [1, 0, 0, 1, 1, 1]\n",
    "\n",
    "Notice that we now have a procedure that converts any sentence to a fixed length numerical vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A disadvantage to this method is that we lose any indication of word order.  The two sentences “tensorflow makes machine learning easy” and “machine learning makes tensorflow easy” would result in the same sentence vector.\n",
    "It is also worthwhile to note that the length of these vectors is equal to the size of our vocabulary that we pick.  \n",
    "It is common to pick a very large vocabulary, so these sentence vectors can be very sparse.  This type of embedding that we have covered in this introduction is called “bag of words”.  We will implement this in the next section.\n",
    "\n",
    "Another drawback is that the words “is” and “tensorflow\" have the same numerical index value of one.  We can imagine that the word “is” might be less important that the occurrence of the word “tensorflow\".\n",
    "We will explore different types of embeddings in this chapter that attempt to address these ideas, but first we start with an implementation of bag of words."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3.0
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}